Skip to content

Conversation

@prosdev
Copy link
Contributor

@prosdev prosdev commented Jan 14, 2026

Summary

This PR completes the transition to GCS + BigQuery as the production-ready storage backend and adds comprehensive Prometheus metrics for observability.

Changes

1. Switch to GCS Default & Remove Firestore (Issue #25)

  • Changed default EVENTKIT_EVENT_STORE to "gcs"
  • Removed all Firestore code, tests, and configuration
  • Updated dependencies and documentation
  • Simplified dependency injection (single storage backend)

2. Prometheus Metrics (Bonus)

  • Core Infrastructure: Metrics server on port 9090 (separate from API)
  • API Metrics: Request count, latency histograms
  • Event Processing: Received, processed, failed counters
  • Storage: Bytes written, files written
  • Queue & Ring Buffer: Enqueue, dequeue, depth, publish success/failure
  • Warehouse Loader: Files processed, errors
  • PubSub Queue: Full instrumentation for distributed deployments

3. Testing & CI

  • E2E test script validates entire pipeline
  • CI optimized (4.5x speedup with pytest-xdist)
  • All 238 tests passing
  • Metrics endpoint verification in E2E

Documentation

  • README updated with metrics section
  • Example Grafana queries
  • Configuration options documented

Migration Notes

  • Breaking: EVENTKIT_EVENT_STORE=firestore no longer supported
  • Action Required: Use GCS or implement custom EventStore
  • New Config: EVENTKIT_METRICS_ENABLED (default: true), EVENTKIT_METRICS_PORT (default: 9090)

Related Issues

Fixes #25

- Add comprehensive e2e test script (scripts/e2e_test.sh)
  - Tests full pipeline: API → Queue → RingBuffer → EventLoader → GCS
  - Uses GCS emulator via Docker
  - Verifies Parquet files in Hive-style partitions
  - All checks passing ✓

- Fix test warnings (0 warnings now)
  - Replace AsyncMock with simpler async mock in ring buffer tests
  - Change asyncio.get_event_loop() to asyncio.new_event_loop() for Python 3.12

- Clean up docker-compose.yml
  - Remove Firestore emulator (no longer used)

All 232 unit tests passing with 0 warnings.
E2E test verifies full event pipeline end-to-end.
- Add prometheus-client dependency
- Create core metrics infrastructure (src/eventkit/metrics.py)
  - Custom registry
  - Version info metric
  - Uptime gauge (auto-updated every 10s)
  - Component health gauges
  - Metrics HTTP server on port 9090

- Add metrics modules for all components:
  - API metrics (requests, duration)
  - Event processing metrics (received, processed, failed)
  - Storage metrics (files, bytes, operations, duration)
  - Queue metrics (enqueued, dequeued, depth)
  - Ring buffer metrics (written, published, size)
  - Warehouse loader metrics (discovered, loaded, pending)

- Instrument key components:
  - API middleware tracks all HTTP requests
  - Processor tracks event flow (received → processed → failed)
  - GCSEventStore tracks storage operations

- Config: EVENTKIT_METRICS_ENABLED and EVENTKIT_METRICS_PORT
- Tests: 238 passing (6 new metrics tests)

Design principles:
- Counter-focused (prefer counters over gauges)
- Low cardinality labels (no unbounded values)
- Naming: eventkit_{verb_noun}_{unit}_{suffix}
- Separate metrics server (port 9090) isolates monitoring traffic

Next: Instrument queue/ring buffer, add documentation
- Ring buffer metrics:
  - Track writes, publishes (success/failure), marked published
  - Track total size and unpublished count (gauges)
  - Auto-update size metrics on write/mark operations

- Ring buffer publisher:
  - Track successful/failed publish attempts

- Async queue metrics:
  - Track enqueue/dequeue operations
  - Track processing success/failure
  - Track queue depth per worker (gauge)

All 238 tests passing. Metrics are now wired throughout the pipeline:
API → Queue → RingBuffer → Processor → Storage
Metrics server runs on separate port (9090) from API (8000).
E2E test now verifies:
- Prometheus format
- API request metrics
- Event processing metrics
- Storage metrics
Document all Prometheus metrics:
- API layer (requests, latency)
- Event processing (received, processed, failed)
- Storage (bytes, files written)
- Queue/ring buffer (depth, enqueue, dequeue)
- Warehouse loader (files processed, errors)
- System (uptime, health, version)

Includes:
- Metrics server configuration
- Example Grafana queries
- Design principles
- All available metrics with labels
Track PubSub-specific operations:
- Ring buffer enqueue (before Pub/Sub publish)
- Pub/Sub publish success/failure
- Messages received from subscription
- Ack/nack with reasons (success, decode_error, processing_error, no_loop)
- Queue depth per worker

Labels match AsyncQueue for consistency:
- queue_mode="pubsub" or "pubsub_published"
- result="ack_success", "nack_processing_error", etc.

All queue modes now fully instrumented for production observability.
The ring buffer integration test relied on Firestore which was removed.
Coverage for ring buffer → queue → storage flow is now provided by:
- Unit tests for individual components
- E2e test script (scripts/e2e_test.sh) with GCS emulator
@prosdev prosdev merged commit 1330c2b into main Jan 15, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Switch to GCS Default & Remove Firestore

2 participants